{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Out-of-the-box `Comparisons` for specific data types\n",
        "\n",
        "Splink has pre-defined `Comparison`s available for variety of data types.\n",
        "\n",
        "<hr>"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## DateOfBirthComparison\n",
        "\n",
        "You can find full API docs for `DateOfBirthComparison` [here](../../api_docs/comparison_library.md#splink.comparison_library.DateOfBirthComparison)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {},
      "outputs": [],
      "source": [
        "import splink.comparison_library as cl\n",
        "\n",
        "date_of_birth_comparison = cl.DateOfBirthComparison(\n",
        "    \"date_of_birth\",\n",
        "    input_is_string=True,\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You can view the structure of the comparison as follows:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Comparison 'DateOfBirthComparison' of \"date_of_birth\".\n",
            "Similarity is assessed using the following ComparisonLevels:\n",
            "    - 'transformed date_of_birth is NULL' with SQL rule: try_strptime(\"date_of_birth_l\", '%Y-%m-%d') IS NULL OR try_strptime(\"date_of_birth_r\", '%Y-%m-%d') IS NULL\n",
            "    - 'Exact match on date of birth' with SQL rule: \"date_of_birth_l\" = \"date_of_birth_r\"\n",
            "    - 'DamerauLevenshtein distance <= 1' with SQL rule: damerau_levenshtein(\"date_of_birth_l\", \"date_of_birth_r\") <= 1\n",
            "    - 'Abs date difference <= 1 month' with SQL rule: ABS(EPOCH(try_strptime(\"date_of_birth_l\", '%Y-%m-%d')) - EPOCH(try_strptime(\"date_of_birth_r\", '%Y-%m-%d'))) <= 2629800.0\n",
            "    - 'Abs date difference <= 1 year' with SQL rule: ABS(EPOCH(try_strptime(\"date_of_birth_l\", '%Y-%m-%d')) - EPOCH(try_strptime(\"date_of_birth_r\", '%Y-%m-%d'))) <= 31557600.0\n",
            "    - 'Abs date difference <= 10 year' with SQL rule: ABS(EPOCH(try_strptime(\"date_of_birth_l\", '%Y-%m-%d')) - EPOCH(try_strptime(\"date_of_birth_r\", '%Y-%m-%d'))) <= 315576000.0\n",
            "    - 'All other comparisons' with SQL rule: ELSE\n",
            "\n"
          ]
        }
      ],
      "source": [
        "print(date_of_birth_comparison.get_comparison(\"duckdb\").human_readable_description)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To see this as a specifications dictionary you can use:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "tags": [
          "hide_output"
        ]
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'output_column_name': 'date_of_birth',\n",
              " 'comparison_levels': [{'sql_condition': 'try_strptime(\"date_of_birth_l\", \\'%Y-%m-%d\\') IS NULL OR try_strptime(\"date_of_birth_r\", \\'%Y-%m-%d\\') IS NULL',\n",
              "   'label_for_charts': 'transformed date_of_birth is NULL',\n",
              "   'is_null_level': True},\n",
              "  {'sql_condition': '\"date_of_birth_l\" = \"date_of_birth_r\"',\n",
              "   'label_for_charts': 'Exact match on date of birth'},\n",
              "  {'sql_condition': 'damerau_levenshtein(\"date_of_birth_l\", \"date_of_birth_r\") <= 1',\n",
              "   'label_for_charts': 'DamerauLevenshtein distance <= 1'},\n",
              "  {'sql_condition': 'ABS(EPOCH(try_strptime(\"date_of_birth_l\", \\'%Y-%m-%d\\')) - EPOCH(try_strptime(\"date_of_birth_r\", \\'%Y-%m-%d\\'))) <= 2629800.0',\n",
              "   'label_for_charts': 'Abs date difference <= 1 month'},\n",
              "  {'sql_condition': 'ABS(EPOCH(try_strptime(\"date_of_birth_l\", \\'%Y-%m-%d\\')) - EPOCH(try_strptime(\"date_of_birth_r\", \\'%Y-%m-%d\\'))) <= 31557600.0',\n",
              "   'label_for_charts': 'Abs date difference <= 1 year'},\n",
              "  {'sql_condition': 'ABS(EPOCH(try_strptime(\"date_of_birth_l\", \\'%Y-%m-%d\\')) - EPOCH(try_strptime(\"date_of_birth_r\", \\'%Y-%m-%d\\'))) <= 315576000.0',\n",
              "   'label_for_charts': 'Abs date difference <= 10 year'},\n",
              "  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n",
              " 'comparison_description': 'DateOfBirthComparison'}"
            ]
          },
          "execution_count": 3,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "date_of_birth_comparison.get_comparison(\"duckdb\").as_dict()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-3-providing-the-spec-as-a-dictionary), if desired.\n",
        "\n",
        "<hr>"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Name Comparison\n",
        "\n",
        "A Name comparison is intended for use on an individual name column (e.g. forename, surname) \n",
        "\n",
        "You can find full API docs for `NameComparison` [here](../../api_docs/comparison_library.md#splink.comparison_library.NameComparison)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {},
      "outputs": [],
      "source": [
        "import splink.comparison_library as cl\n",
        "\n",
        "first_name_comparison = cl.NameComparison(\"first_name\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Comparison 'NameComparison' of \"first_name\".\n",
            "Similarity is assessed using the following ComparisonLevels:\n",
            "    - 'first_name is NULL' with SQL rule: \"first_name_l\" IS NULL OR \"first_name_r\" IS NULL\n",
            "    - 'Exact match on first_name' with SQL rule: \"first_name_l\" = \"first_name_r\"\n",
            "    - 'Jaro-Winkler distance of first_name >= 0.92' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.92\n",
            "    - 'Jaro-Winkler distance of first_name >= 0.88' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.88\n",
            "    - 'Jaro-Winkler distance of first_name >= 0.7' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.7\n",
            "    - 'All other comparisons' with SQL rule: ELSE\n",
            "\n"
          ]
        }
      ],
      "source": [
        "print(first_name_comparison.get_comparison(\"duckdb\").human_readable_description)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The [NameComparison](../../api_docs/comparison_library.md)  also allows flexibility to change the parameters and/or fuzzy matching comparison levels.\n",
        "\n",
        "For example:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Comparison 'NameComparison' of \"surname\" and \"surname_dmeta\".\n",
            "Similarity is assessed using the following ComparisonLevels:\n",
            "    - 'surname is NULL' with SQL rule: \"surname_l\" IS NULL OR \"surname_r\" IS NULL\n",
            "    - 'Exact match on surname' with SQL rule: \"surname_l\" = \"surname_r\"\n",
            "    - 'Jaro-Winkler distance of surname >= 0.95' with SQL rule: jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.95\n",
            "    - 'Jaro-Winkler distance of surname >= 0.9' with SQL rule: jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.9\n",
            "    - 'Array intersection size >= 1' with SQL rule: array_length(list_intersect(\"surname_dmeta_l\", \"surname_dmeta_r\")) >= 1\n",
            "    - 'All other comparisons' with SQL rule: ELSE\n",
            "\n"
          ]
        }
      ],
      "source": [
        "surname_comparison = cl.NameComparison(\n",
        "    \"surname\",\n",
        "    jaro_winkler_thresholds=[0.95, 0.9],\n",
        "    dmeta_col_name=\"surname_dmeta\",\n",
        ")\n",
        "print(surname_comparison.get_comparison(\"duckdb\").human_readable_description)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Where `surname_dm` refers to a column which has used the DoubleMetaphone algorithm on `surname` to give a phonetic spelling. This helps to catch names which sounds the same but have different spellings (e.g. Stephens vs Stevens). For more on Phonetic Transformations, see the [topic guide](./phonetic.md)."
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To see this as a specifications dictionary you can call"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "tags": [
          "hide_output"
        ]
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'output_column_name': 'surname',\n",
              " 'comparison_levels': [{'sql_condition': '\"surname_l\" IS NULL OR \"surname_r\" IS NULL',\n",
              "   'label_for_charts': 'surname is NULL',\n",
              "   'is_null_level': True},\n",
              "  {'sql_condition': '\"surname_l\" = \"surname_r\"',\n",
              "   'label_for_charts': 'Exact match on surname',\n",
              "   'tf_adjustment_column': 'surname',\n",
              "   'tf_adjustment_weight': 1.0},\n",
              "  {'sql_condition': 'jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.95',\n",
              "   'label_for_charts': 'Jaro-Winkler distance of surname >= 0.95'},\n",
              "  {'sql_condition': 'jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.9',\n",
              "   'label_for_charts': 'Jaro-Winkler distance of surname >= 0.9'},\n",
              "  {'sql_condition': 'array_length(list_intersect(\"surname_dmeta_l\", \"surname_dmeta_r\")) >= 1',\n",
              "   'label_for_charts': 'Array intersection size >= 1'},\n",
              "  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n",
              " 'comparison_description': 'NameComparison'}"
            ]
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "surname_comparison.get_comparison(\"duckdb\").as_dict()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-3-providing-the-spec-as-a-dictionary), if desired.\n",
        "\n",
        "<hr>"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Forename and Surname Comparison\n",
        "\n",
        "\n",
        "It can be helpful to construct a single comparison for for comparing the forename and surname because:\n",
        "\n",
        "1. The Fellegi-Sunter model **assumes that columns are independent**. We know that forename and surname are usually correlated given the regional variation of names etc, so considering then in a single comparison can help to create better models.\n",
        "\n",
        "    As a result **term-frequencies** of individual forename and surname individually does not necessarily reflect how common the combination of forename and surname are.  For more information on term-frequencies, see the dedicated [topic guide](term-frequency.md). Combining forename and surname in a single comparison can allows the model to consider the joint term-frequency as well as individual.\n",
        "\n",
        "2. It is common for some records to have **swapped forename and surname by mistake**. Addressing forename and surname in a single comparison can allows the model to consider these name inversions.\n",
        "\n",
        "The `ForenameSurnameComparison` has been designed to accomodate this.\n",
        "\n",
        "You can find full API docs for `ForenameSurnameComparison` [here](../../api_docs/comparison_library.md#splink.comparison_library.ForenameSurnameComparison)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {},
      "outputs": [],
      "source": [
        "import splink.comparison_library as cl\n",
        "\n",
        "full_name_comparison = cl.ForenameSurnameComparison(\"forename\", \"surname\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Comparison 'ForenameSurnameComparison' of \"forename\" and \"surname\".\n",
            "Similarity is assessed using the following ComparisonLevels:\n",
            "    - '(forename is NULL) AND (surname is NULL)' with SQL rule: (\"forename_l\" IS NULL OR \"forename_r\" IS NULL) AND (\"surname_l\" IS NULL OR \"surname_r\" IS NULL)\n",
            "    - '(Exact match on forename) AND (Exact match on surname)' with SQL rule: (\"forename_l\" = \"forename_r\") AND (\"surname_l\" = \"surname_r\")\n",
            "    - 'Match on reversed cols: forename and surname' with SQL rule: \"forename_l\" = \"surname_r\" AND \"forename_r\" = \"surname_l\"\n",
            "    - '(Jaro-Winkler distance of forename >= 0.92) AND (Jaro-Winkler distance of surname >= 0.92)' with SQL rule: (jaro_winkler_similarity(\"forename_l\", \"forename_r\") >= 0.92) AND (jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.92)\n",
            "    - '(Jaro-Winkler distance of forename >= 0.88) AND (Jaro-Winkler distance of surname >= 0.88)' with SQL rule: (jaro_winkler_similarity(\"forename_l\", \"forename_r\") >= 0.88) AND (jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.88)\n",
            "    - 'Exact match on surname' with SQL rule: \"surname_l\" = \"surname_r\"\n",
            "    - 'Exact match on forename' with SQL rule: \"forename_l\" = \"forename_r\"\n",
            "    - 'All other comparisons' with SQL rule: ELSE\n",
            "\n"
          ]
        }
      ],
      "source": [
        "print(full_name_comparison.get_comparison(\"duckdb\").human_readable_description)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As noted in the [feature engineering guide](../data_preparation//feature_engineering.md), to take advantage of term frequency adjustments on full name, you need to derive a full name column prior to importing data into Splink.  You then provide the column name using the `forename_surname_concat_col_name` argument:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {},
      "outputs": [],
      "source": [
        "full_name_comparison = cl.ForenameSurnameComparison(\"forename\", \"surname\", forename_surname_concat_col_name=\"first_and_last_name\")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To see this as a specifications dictionary you can call"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "tags": [
          "hide_output"
        ]
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'output_column_name': 'forename_surname',\n",
              " 'comparison_levels': [{'sql_condition': '(\"forename_l\" IS NULL OR \"forename_r\" IS NULL) AND (\"surname_l\" IS NULL OR \"surname_r\" IS NULL)',\n",
              "   'label_for_charts': '(forename is NULL) AND (surname is NULL)',\n",
              "   'is_null_level': True},\n",
              "  {'sql_condition': '\"first_and_last_name_l\" = \"first_and_last_name_r\"',\n",
              "   'label_for_charts': 'Exact match on first_and_last_name',\n",
              "   'tf_adjustment_column': 'first_and_last_name',\n",
              "   'tf_adjustment_weight': 1.0},\n",
              "  {'sql_condition': '\"forename_l\" = \"surname_r\" AND \"forename_r\" = \"surname_l\"',\n",
              "   'label_for_charts': 'Match on reversed cols: forename and surname'},\n",
              "  {'sql_condition': '(jaro_winkler_similarity(\"forename_l\", \"forename_r\") >= 0.92) AND (jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.92)',\n",
              "   'label_for_charts': '(Jaro-Winkler distance of forename >= 0.92) AND (Jaro-Winkler distance of surname >= 0.92)'},\n",
              "  {'sql_condition': '(jaro_winkler_similarity(\"forename_l\", \"forename_r\") >= 0.88) AND (jaro_winkler_similarity(\"surname_l\", \"surname_r\") >= 0.88)',\n",
              "   'label_for_charts': '(Jaro-Winkler distance of forename >= 0.88) AND (Jaro-Winkler distance of surname >= 0.88)'},\n",
              "  {'sql_condition': '\"surname_l\" = \"surname_r\"',\n",
              "   'label_for_charts': 'Exact match on surname',\n",
              "   'tf_adjustment_column': 'surname',\n",
              "   'tf_adjustment_weight': 1.0},\n",
              "  {'sql_condition': '\"forename_l\" = \"forename_r\"',\n",
              "   'label_for_charts': 'Exact match on forename',\n",
              "   'tf_adjustment_column': 'forename',\n",
              "   'tf_adjustment_weight': 1.0},\n",
              "  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n",
              " 'comparison_description': 'ForenameSurnameComparison'}"
            ]
          },
          "execution_count": 12,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "full_name_comparison.get_comparison(\"duckdb\").as_dict()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-3-providing-the-spec-as-a-dictionary), if desired."
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<hr>"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Postcode Comparisons\n",
        "\n",
        "See [Feature Engineering](../data_preparation/feature_engineering.md#postcodes) for more details."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import splink.comparison_library as cl\n",
        "\n",
        "pc_comparison = cl.PostcodeComparison(\"postcode\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Comparison 'PostcodeComparison' of \"postcode\".\n",
            "Similarity is assessed using the following ComparisonLevels:\n",
            "    - 'postcode is NULL' with SQL rule: \"postcode_l\" IS NULL OR \"postcode_r\" IS NULL\n",
            "    - 'Exact match on full postcode' with SQL rule: \"postcode_l\" = \"postcode_r\"\n",
            "    - 'Exact match on sector' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')\n",
            "    - 'Exact match on district' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '')\n",
            "    - 'Exact match on area' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}', 0), '')\n",
            "    - 'All other comparisons' with SQL rule: ELSE\n",
            "\n"
          ]
        }
      ],
      "source": [
        "print(pc_comparison.get_comparison(\"duckdb\").human_readable_description)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "If you have derived lat long columns, you can model geographical distances.  "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Comparison 'PostcodeComparison' of \"postcode\", \"long\" and \"lat\".\n",
            "Similarity is assessed using the following ComparisonLevels:\n",
            "    - 'postcode is NULL' with SQL rule: \"postcode_l\" IS NULL OR \"postcode_r\" IS NULL\n",
            "    - 'Exact match on postcode' with SQL rule: \"postcode_l\" = \"postcode_r\"\n",
            "    - 'Exact match on transformed postcode' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')\n",
            "    - 'Distance in km <= 1' with SQL rule:  cast( acos( case when ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) > 1 then 1 when ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) < -1 then -1 else ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) end ) * 6371 as float ) <= 1\n",
            "    - 'Distance in km <= 10' with SQL rule:  cast( acos( case when ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) > 1 then 1 when ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) < -1 then -1 else ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) end ) * 6371 as float ) <= 10\n",
            "    - 'Distance in km <= 50' with SQL rule:  cast( acos( case when ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) > 1 then 1 when ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) < -1 then -1 else ( sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) + cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") ) * cos( radians(\"long_r\" - \"long_l\") ) ) end ) * 6371 as float ) <= 50\n",
            "    - 'All other comparisons' with SQL rule: ELSE\n",
            "\n"
          ]
        }
      ],
      "source": [
        "\n",
        "pc_comparison = cl.PostcodeComparison(\"postcode\", lat_col=\"lat\", long_col=\"long\", km_thresholds=[1,10,50])\n",
        "print(pc_comparison.get_comparison(\"duckdb\").human_readable_description)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To see this as a specifications dictionary you can call"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "tags": [
          "hide_output"
        ]
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'output_column_name': 'postcode',\n",
              " 'comparison_levels': [{'sql_condition': '\"postcode_l\" IS NULL OR \"postcode_r\" IS NULL',\n",
              "   'label_for_charts': 'postcode is NULL',\n",
              "   'is_null_level': True},\n",
              "  {'sql_condition': '\"postcode_l\" = \"postcode_r\"',\n",
              "   'label_for_charts': 'Exact match on postcode'},\n",
              "  {'sql_condition': 'NULLIF(regexp_extract(\"postcode_l\", \\'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]\\', 0), \\'\\') = NULLIF(regexp_extract(\"postcode_r\", \\'^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]\\', 0), \\'\\')',\n",
              "   'label_for_charts': 'Exact match on transformed postcode'},\n",
              "  {'sql_condition': '\\n        cast(\\n            acos(\\n                \\n        case\\n            when (\\n        sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) +\\n        cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") )\\n            * cos( radians(\"long_r\" - \"long_l\") )\\n    ) > 1 then 1\\n            when (\\n        sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) +\\n        cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") )\\n            * cos( radians(\"long_r\" - \"long_l\") )\\n    ) < -1 then -1\\n            else (\\n        sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) +\\n        cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") )\\n            * cos( radians(\"long_r\" - \"long_l\") )\\n    )\\n        end\\n    \\n            ) * 6371\\n            as float\\n        )\\n     <= 1',\n",
              "   'label_for_charts': 'Distance in km <= 1'},\n",
              "  {'sql_condition': '\\n        cast(\\n            acos(\\n                \\n        case\\n            when (\\n        sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) +\\n        cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") )\\n            * cos( radians(\"long_r\" - \"long_l\") )\\n    ) > 1 then 1\\n            when (\\n        sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) +\\n        cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") )\\n            * cos( radians(\"long_r\" - \"long_l\") )\\n    ) < -1 then -1\\n            else (\\n        sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) +\\n        cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") )\\n            * cos( radians(\"long_r\" - \"long_l\") )\\n    )\\n        end\\n    \\n            ) * 6371\\n            as float\\n        )\\n     <= 10',\n",
              "   'label_for_charts': 'Distance in km <= 10'},\n",
              "  {'sql_condition': '\\n        cast(\\n            acos(\\n                \\n        case\\n            when (\\n        sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) +\\n        cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") )\\n            * cos( radians(\"long_r\" - \"long_l\") )\\n    ) > 1 then 1\\n            when (\\n        sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) +\\n        cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") )\\n            * cos( radians(\"long_r\" - \"long_l\") )\\n    ) < -1 then -1\\n            else (\\n        sin( radians(\"lat_l\") ) * sin( radians(\"lat_r\") ) +\\n        cos( radians(\"lat_l\") ) * cos( radians(\"lat_r\") )\\n            * cos( radians(\"long_r\" - \"long_l\") )\\n    )\\n        end\\n    \\n            ) * 6371\\n            as float\\n        )\\n     <= 100',\n",
              "   'label_for_charts': 'Distance in km <= 100'},\n",
              "  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n",
              " 'comparison_description': 'PostcodeComparison'}"
            ]
          },
          "execution_count": 4,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "pc_comparison.get_comparison(\"duckdb\").as_dict()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-3-providing-the-spec-as-a-dictionary), if desired."
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<hr>"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Email Comparison\n",
        "\n",
        "You can find full API docs for `EmailComparison` [here](../../api_docs/comparison_library.md#splink.comparison_library.EmailComparison)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import splink.comparison_library as cl\n",
        "\n",
        "email_comparison = cl.EmailComparison(\"email\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Comparison 'EmailComparison' of \"email\".\n",
            "Similarity is assessed using the following ComparisonLevels:\n",
            "    - 'email is NULL' with SQL rule: \"email_l\" IS NULL OR \"email_r\" IS NULL\n",
            "    - 'Exact match on email' with SQL rule: \"email_l\" = \"email_r\"\n",
            "    - 'Exact match on username' with SQL rule: NULLIF(regexp_extract(\"email_l\", '^[^@]+', 0), '') = NULLIF(regexp_extract(\"email_r\", '^[^@]+', 0), '')\n",
            "    - 'Jaro-Winkler distance of email >= 0.88' with SQL rule: jaro_winkler_similarity(\"email_l\", \"email_r\") >= 0.88\n",
            "    - 'Jaro-Winkler >0.88 on username' with SQL rule: jaro_winkler_similarity(NULLIF(regexp_extract(\"email_l\", '^[^@]+', 0), ''), NULLIF(regexp_extract(\"email_r\", '^[^@]+', 0), '')) >= 0.88\n",
            "    - 'All other comparisons' with SQL rule: ELSE\n",
            "\n"
          ]
        }
      ],
      "source": [
        "print(email_comparison.get_comparison(\"duckdb\").human_readable_description)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To see this as a specifications dictionary you can call"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "tags": [
          "hide_output"
        ]
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'output_column_name': 'email',\n",
              " 'comparison_levels': [{'sql_condition': '\\n        regexp_extract(\"email_l\", \\'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\\')\\n     IS NULL OR \\n        regexp_extract(\"email_r\", \\'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\\')\\n     IS NULL OR\\n                      \\n        regexp_extract(\"email_l\", \\'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\\')\\n    ==\\'\\' OR \\n        regexp_extract(\"email_r\", \\'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+[.][a-zA-Z]{2,}$\\')\\n     ==\\'\\'',\n",
              "   'label_for_charts': 'Null',\n",
              "   'is_null_level': True},\n",
              "  {'sql_condition': '\"email_l\" = \"email_r\"',\n",
              "   'label_for_charts': 'Exact match email'},\n",
              "  {'sql_condition': 'jaro_similarity(\"email_l\", \"email_r\") >= 0.8',\n",
              "   'label_for_charts': 'Jaro_similarity email >= 0.8'},\n",
              "  {'sql_condition': 'jaro_similarity(\"email_l\", \"email_r\") >= 0.8',\n",
              "   'label_for_charts': 'Jaro_similarity email >= 0.8'},\n",
              "  {'sql_condition': '\\n        regexp_extract(\"email_l\", \\'@([^@]+)$\\')\\n     = \\n        regexp_extract(\"email_r\", \\'@([^@]+)$\\')\\n    ',\n",
              "   'label_for_charts': 'Exact match Email Domain'},\n",
              "  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n",
              " 'comparison_description': 'Exact match vs. Domain-only match vs.anything else'}"
            ]
          },
          "execution_count": 20,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "email_comparison.as_dict()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "which can be used as the basis for a more custom comparison, as shown in the [Defining and Customising Comparisons topic guide ](customising_comparisons.ipynb#method-3-providing-the-spec-as-a-dictionary), if desired."
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "base",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.8"
    },
    "orig_nbformat": 4
  },
  "nbformat": 4,
  "nbformat_minor": 2
}
